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Abstract. Query expansion is a well known method to improve the per- 
formance of information retrieval systems. In this work we have tested 
different approaches to extract the candidate query terms from the top 
ranked documents returned by the first-pass retrieval. One of them is 
the cooccurrence approach, based on measures of cooccurrence of the 
candidate and the query terms in the retrieved documents. The other 
one, the probabilistic approach, is based on the probability distribution 
of terms in the collection and in the top ranked set. We compare the 
retrieval improvement achieved by expanding the query with terms ob- 
tained with different methods belonging to both approaches. Besides, 
we have developed a nai've combination of both kinds of method, with 
which we have obtained results that improve those obtained with any of 
them separately. This result confirms that the information provided by 
each approach is of a different nature and, therefore, can be used in a 
combined manner. 



1 Introduction 

The reformulation of the user queries is a common technique in information 
retrieval to cover the gap between the original user query and his necessity of 
information. The most used technique for query reformulation is query expan- 
sion, where the original user query is expanded with new terms extracted from 
different sources. Queries submitted for users are usually very short and query 
expansion can complete the information need of the users. 

A very complete review on the classical techniques of query expansion was 
done by Efthimiadis [?]. The main problem of query expansion is that in some 
cases the expansion process worsen the query performance. Improving the ro- 
bustness of query expansion has been the goal of many researchers for the last 
years, and most proposed approaches use external collections [], such as the 
Web documents, to extract candidate terms for the expansion. There are other 
methods that extract the candidate terms from the same collection where the 
search is performed [] . Some of these methods are based on global analysis where 
the list of candidate terms is generated from the whole collection, but the are 
computationally very expensive and its effectiveness is not better than the one 
of methods based on local analysis. We also use the same collection where the 
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search is performed, but applying local query expansion, also known as pseudo- 
feedback or blind feedback, which does not use the global collection or external 
sources for the expansion. This approach was first proposed by Xu and Croft 
and extracts the expansion terms from the documents retrieved for the original 
user query in a first pass retrieval. 

In this work we have tested different approaches to extract the candidate 
terms from the top ranked documents returned by the first-pass retrieval. There 
exist two main approaches to rank the terms extracted from the retrieval doc- 
uments. One of them is the cooccurrence approach, based on measures of cooc- 
currence of the candidate and the query terms in the retrieved documents. The 
other one is the probabilistic approach and is based on the differences of the 
probability distribution of terms in the collection and in the top ranked set. In 
this paper we are interested in evaluating the different techniques existing to 
generate the candidate term list. Our thesis is that the information obtained 
with the cooccurrence methods is different from the information obtained with 
probabilistic methods and these two kinds of information can be combined to 
improve the performance of the query expansion process. Accordingly, our goal 
has been to compare the performance of the cooccurrence approach and the 
probabilistic techniques and to study the way of combining them to improve the 
query expansion process. 

After the term extraction step, the query expansion process requires a further 
step that is to re-compute the weights of the query terms that will be used in 
the search process. We present the results of combining different methods for 
the term extraction and the reweighting steps. 

Two important parameters have to be adjusted for the described process. 
One of then is the number of documents retrieved in the first pass to be used for 
the term extraction. The other one is the number of candidate terms that are 
finally used to expand the original user query. We have performed experiments 
to set both of them to its optimal value in each considered method. 

The rest of the paper proceeds as follows: sections 2 and 3 describe the cooc- 
currence and probabilistic approaches, respectively; section 4 presents our pro- 
posal of combining both approaches; section 5 describes the different reweighting 
methods considered to assigned new weight to the query terms after the expan- 
sion process; section 6 is devoted to show the experiments performed to evaluate 
the different expansion techniques separately and combined and section 7 sum- 
marizes the main conclusions of this work. 

2 Cooccurrence Methods 

The methods based on term cooccurrence have been used since the 70's to iden- 
tify some of the semantic relationships that exit among terms. In the first works 
of K. Van Rijsbergen and K. Sparck Jones we find the idea of using cooccurrence 
statistics to detect some kind of semantic similarity between terms and using it 
to expand the user's queries. In fact, this idea is based on the Association Hy- 
pothesis: 
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If an index term is good at discriminating relevant from non-relevant 
documents then any closely associated index term is likely to be good at this. 

The main problem with the cooccurrence approach was mentioned by Peat 
and Willet that claim that similar terms identified by cooccurrence tend to 
occur also very frequently in the collection and therefore these terms are not 
good elements to discriminate between relevant and non-relevant documents. 
This is true when the cooccurrence analysis is done on the whole collection but 
if we apply cooccurrence analysis only on the top ranked documents the problem 
exposed by Peat and Willet is smoothed. 

For our experiments we have used the well-know Cosine, Dice and Tanimoto 
coefficients: 

Tanimoto(ti,tj) = (1) 

Ci ~\~ Cj Cij 

Dice{tutj) = ll^L (2 ) 

Ci ™r Cj 

Cosine(tj.tj) = — lj (3) 

y/Ci * Cj 

where Ci and Cj are the number of documents in which terms ti and tj occur, 
respectively, and Cjj is the number of documents in which ti and tj cooccur. 

We apply these coefficients to measure the similarity between terms repre- 
sented by the vectors. The result is a ranking of candidate terms where the most 
useful terms for expansion are in the top. 

In the selection method the most likely terms are selected using the next 
equation: 

rel(q,t e ) = ^ & * ASS(ti,t e ) (4) 

where ASS is one of the cooccurrence coefficients: Tanimoto, Dice, or Cosine. 
The equation ?? boosted the terms related with more terms of the original query. 

The results obtained with each of these measures, presented in section ??, 
show that Tanimoto performs better. 



3 Distribution Analysis Approaches 

One of the main approaches to query expansion is based on studying the dif- 
ference of term distribution between the whole collection and the subsets of 
documents that can be relevant for the query. It is expected that terms with 
little informative content have a similar distribution in any document of the col- 
lection. On the contrary terms closely related to those of the original query are 
expected to be more frequent in the top ranked set of documents retrieved with 
the original query than in other subsets of the collection. 



4 J. Perez- Agiiera & L. Araujo 



3.1 Information-theoretic approach 

One of the most interesting approaches based on term distribution analysis has 
been proposed by C. Carpineto et. al. [?], and uses the concept the Kullback- 
Liebler Divergence to compute the divergence between the probability distribu- 
tions of terms in the whole collection and in the top ranked documents obtained 
for a first pass retrieval using the original user query. The most likely terms to 
expand the query are those with a high probability in the top ranked set and 
low probability in the whole collection. This divergence is computed as: 

KLD (PR . PC) = P R (t) * log^^ (5) 

where Pr(£) is the probability of the term t in the top ranked documents, 
and Pc(t) is the probability of the term t in the whole collection. 



3.2 Divergence Prom Randomness term weighting model 

The Divergence From Randomness (DFR) [?] term weighting model infers the 
informativeness of a term by the divergence between its distribution in the top- 
ranked documents and a random distribution. The most effective DFR term 
weighting model is the Bol model that uses the Bose-Einstein statistics [?,?]: 

w{t) = tf x * log 2 " + log<l + P„) (6) 

where tf x is the frequency of the query term in the top-ranked documents and 
P n is given by jj, where F is the frequency of the query term in the collection 
and N is the number of documents in the collection. 



4 Combined query expansion method 

The two approaches tested in this work can complement each other because they 
rely on different information. The performance of the cooccurrence approach 
is reduced by words, which are not stop-words, but are very frequent in the 
collection [?]. Those words, which represent a kind of noise, can reach a high 
position in the term index, thus worsen the expansion process. However, precisely 
because their high probability in any set of the document collection, these words 
tend to have a low score in KLD or Bol. Accordingly, combining the cooccurrence 
measures with others based on the informative content of the terms, such as 
KLD or Bol, helps to eliminate the noisy terms, thus improving the retrieved 
information with the query expansion process. 

Our combined model amounts to applying both, a coocurrence method and 
a distributional method and then obtaining the list of candidate terms by in- 
tersecting the lists provided by each method separately. Finally, the terms of 
the resulting list are assigned a new weight by one of the reweighting method 
considered. 
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In the combined approach the number of selected terms depends of the over- 
lapping between the term sets proposed by both approaches. To increase the 
intersection area and obtain enough candidate terms in the combined list it is 
neccesary to increase the number of selected terms for the non-combined ap- 
proaches. This issue has been studied in the experiments. 



5 Methods for Reweighting the Expanded Query Terms 

After the list of candidate terms has been generated by one of the methods 
described above, the selected terms which will be added to the query must be 
re-weighted. Different schemas have been proposed for this task. We have com- 
pared these schemas and tested which is the most appropriate for each expansion 
method and for our combined query expansion method. 

The classical approach to term re- weighting is the Rocchio algorithm [?]. 
In this work we have used Rocchio's beta formula, which requires only the [3 
parameter, and computes the new weight qtw of the term in the query as: 

Qtf „ w(t) 

qtw = + p * — 4- (7) 

where w(t) is the old weight of term t, w max (t) is the maximum w(t) of the 
expanded query terms, [3 is a parameter, qtf is the frequency of the term t in 
the query and qtf max is the maximum term frequency in the query q. In all our 
experiments, is set to 0.1. 

We have also tested other reweighting schemes, each of which directly comes 
from one of the proposed methods for the candidate term selection. These 
schemes use the ranking values obtained applying the function defined by each 
method. Each of them can only be applied to reweight terms selected with the 
method it derives from. This is due to these methods require data, collected 
during the selection process, which are specific of each of them. 

For the case of the reweighting scheme derived from KLD, the new weight is 
directly obtained applying KLD to the candidate terms. Terms belonging to the 
original query maintain their value [?] . 

For the scheme deriving from the cooccurrence method, that we called SumASS, 
the weights of the candidate terms are computed by: 

, rel(q,t e ) 
qtw = ^= (8) 

where Y] t . eq qi is the sum of the weights of the original terms [?]. 

Finally, for the reweighting scheme deriving from the Bose-Einstcin statistics, 
a normalization of Bol, that we call BoNorm, we have defined a simple function 
based in the normalization of the values obtained by Bose-Einstein computation: 

(9) 

where B°t£rf is the sum of the Bose-Einstein values for all terms included in 
the candidate list obtained applying Bose-Einstein statistics. 
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6 Experiments 

Lucene Vector Space Model implementation has been used to build our infor- 
mation retrieval system. Stemming and stopword removing has been applied in 
indexing and expansion process. Evaluation is carried out on the Spanish EFE94 
corpus which is part of the CLEF collection [?] (approximately 215K documents 
of 330 average word length and 352K unique index terms) and the 2001 Spanish 
topic set, with 100 topics corresponding to 2001 and 2002 years, of which we 
only used the title (of 3.3 average word length). 

We have used different measures to evaluate each method. Each of them 
provides a different estimation of the precision of the retrieved documents, which 
is the main parameter to optimize when doing query expansion, since recall is 
always improved by the query expansion process. The measures considered have 
been: 

— MAP (Mean Average Precision), which is the average of the precision (per- 
cent of retrieved documents that are relevant) value obtained for the top set 
documents existing after each relevant document is retrieved. In this way 
MAP measures precision at all recall levels and provides a view of both 
aspects. 

— GMAP, a variant of MAP, that uses a geometric mean rather than an arith- 
metic mean to average individual topic results. 

— Precision@X, which is precision after X documents (whether relevant or non- 
relevant) have been retrieved. If X documents were not retrieved for a query, 
then all missing documents are assumed to be non-relevant. 

— R-Precision, which measures precision after R documents have been re- 
trieved, where R is the total number of relevant documents for a query. 
If R is greater than the number of documents retrieved for a query, then the 
non-retrieved documents are all assumed to be non- relevant. 

First of all we have tested the different cooccurrence methods described 
above. Table ?? shows the results obtained for the different measures consid- 
ered in this work. We can observe that Tanimoto provides the best results for all 
the measures, except for P@10, but in this case the difference with the result of 
Dice, which is the best, is very small. According to the results we have selected 
the Tanimoto similarity function as coocurrence method for the rest of the work. 



6.1 Selecting the Reweighting Method 

The next set of experiments have had the goal of determining the most appro- 
priate reweighting method for each candidate term selection method. Table ?? 
shows the results of different reweighting methods (Rocchio and SumASS) ap- 
plied after selecting the candidate terms by the cooccurrence method. We can 
observe that the results are quite similar for both reweighting methods, though 
Rocchio is slighly better. 
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Abstract. Query expansion is a well known method to improve the 
performance of the information retrieval systems. In this work we have 
tested different approaches to extract the candidate query terms from the 
top ranked documents returned by the first-pass retrieval. One of them is 
the co-occurrence approach, based on measures of co-occurrence of the 
candidate and the query terms in the retrieved documents. The other 
one, the probabilistic approach, is based on the probability distribution 
of terms in the collection and in the top ranked set. We compare the 
retrieval improvement obtained expanding the query with terms obtained 
with different methods belonging to both approaches. Besides, we have 
developed a naive combination of both approaches, with which we have 
obtained results that improve the obtained with any of them separately. 
This result confirms that the information provided by each approach has 
a different nature and, therefore can be used in a combined manner. 



1 Introduction 

The reformulation of the user queries is a common technique in information 
retrieval to cover the gap between the original user query and his necessity of 
information. The most used technique for query reformulation is query expan- 
sion, where the original query user is expanded with new terms extracted from 
different sources. Queries submitted for users are usually very short and query 
expansion can complete the information need of the users. 

A very complete review on the classical techniques of query expansion was 
done by Efthimiadis [?]. Different methods of query expansion have been used 
to improve the retrieval performance in the different tracks of Text Retrieval 
Conference TREC especially in Robust and HARD tracks but also in Web and 
Terabyte tracks. 

The main problem for the query expansion methods is that in some cases 
the expansion process worsen the query performance. Improving the robustness 
of query expansion has been the goal of many researchers for the last years, 
and most proposed approaches use external collections, such as the Web docu- 
ments, to extract candidate terms for the expansion. There are other methods 
that extract the candidate terms from the same collection where the search is 
performed. Some of these methods are based on global analysis where the list 
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of candidate terms is generated from the whole collection, but the are com- 
putationally very expensive and its effectiveness is not better than the one of 
methods based on local analysis. We follow also use the same collection where the 
search is performed, but applying local query expansion, also known as pseudo- 
feedback or blind feedback, which does not uses the global collection or external 
sources to expansion. This approach was first proposed by Xu and Croft and 
uses the documents retrieved for the original user query in a first pass for the 
term extraction. 

In this work wc have tested different approaches to extract the candidate 
terms from the top ranked documents returned by the first-pass retrieval. There 
exist two main approaches to rank the terms extracted from the retrieval doc- 
uments. One of them is the co-occurrence approach, based on measures of co- 
occurrence of the candidate and the query terms in the retrieved documents. 
The other one, the probabilistic approach, is based on the probability distribu- 
tion of terms in the collection and in the top ranked set. In this paper we are 
interested in testing the different techniques existing to generate the candidate 
term list. Our thesis is that the information obtained with the co-occurrence 
methods is different to the information obtained with probabilistic methods and 
these two kinds of information can be combined to improve the performance of 
the query expansion process. Our main goal is to compare the performance of 
the co-occurrence approach and the probabilistic techniques and to study the 
way of combining them to improve the query expansion process. 

After the term extraction step, the query expansion process requires a further 
step that is to re-compute the weights of the query terms that will be used in 
the search process. We present the results of combining different methods for 
the term extraction and the reweighting steps. 

Two important parameters have to be adjusted for the described process. 
One of then is the number of documents retrieved in the first pass to be used for 
the term extraction. The other one is the number of candidate terms that are 
finally used to expand the original query user. We have performed experiments 
to set both of them to its optimal value in each considered method. 

The rest of the paper proceeds as follows: sections 2 and 3 describes the co- 
occurrence and the probabilistic approaches, respectively; section 4 presents the 
different reweighting methods considered to assigned new weight to the query 
terms after the expansion process; section 5 is devoted to show the experiments 
performed to evaluate the different expansion techniques separately and com- 
bined and section 6 summarizes the main conclusions of this work. 



2 Cooccurrence Methods 

The methods based on term co-occurrence have been used since the 70 's to 
identify some of the semantic relationships that exit among terms. In the first 
works of K. Van Rijsbergen and K. Sparck Jones we find the idea of using co- 
occurrence statistics to detect some kind of semantic similarity between terms 



Lecture Notes in Computer Science 3 



and using it to expand the user's queries. In fact, this idea is based on the 
Association Hypothesis: 

If an index term is good at discriminating relevant from non-relevant 
documents then any closely associated index term is likely to be good at this. 

The main problem with the co-occurrence approach was mentioned by Peat 
and Willet that claim that similar terms identified by co-occurrence tend to 
occur also very frequently in the collection and therefore these terms are not 
good elements to discriminate between relevant and non-relevant documents. 
This is true when the co-occurrence analysis is done on the whole collection 
but if we apply co-occurrence analysis only on the top ranked documents the 
problem exposed by Peat and Willet is smoothed. 

For our experiments we have used the well-know Cosine, Dice and Tanimoto 
coefficients: 

Tanimoto(U,tj) = (1) 

C{ ~\~ Cj C^j 

Dice{tutj) = ^L (2 ) 

Ci ~r Cj 

Cosinejt^tj) = ° ij - (3) 

\J c i * c j 

where Cj and Cj are the number of documents in which terms and tj occur, 
respectively, and is the number of documents in which ti and tj co-occur. 
The results obtained with each of these measures, shown in section ??, show 
that Tanimoto performs better. 

We apply these coefficients to measure the similarity between terms repre- 
sented by the vectors. The result is a ranking of candidate terms where the most 
useful terms for expansion arc in the top. 

In the selection method the most likely terms are selected using the next 
equation: 

rel{q,t e ) = ^2q i *ASS(t i ,t e ) (4) 

where ASS is one of the coocurrence coefficients: Tanimoto, Dice, or Cosine. The 
equation ?? boosted the terms related with more terms of the original query 

3 Distribution Analysis Approaches 

One of the main approaches to query expansion is based on studying the dif- 
ference of term distribution between the whole collection and the subsets of 
documents that can be relevant for the query. It is expected that terms with 
little informative content have a similar distribution in any document of the col- 
lection. On the contrary terms closely related to those of the original query are 
expected to be more frequent in the top ranked set of documents retrieved with 
the original query than in other subsets of the collection. 
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3.1 Information-theoretic approach 

One of the most interesting approaches based on term distribution analysis has 
been proposed by C. Carpineto et. al.[?], and uses the concept the Kullback- 
Liebler Divergence to compute the divergence between to probability distribu- 
tions of terms in the whole collection and in the top ranked documents obtained 
for a first pass retrieval using the original query user. The most likely terms to 
expand the query are those with a high probability in the top ranked set and 
low probability in the whole collection. This divergence is computed as: 
information theory... 

KLD [PR . PC) - P R (t) * log^t (5) 

where Pn(t) is the probability of the term t in the top ranked documents, 
and Pc(t) is the probability of the term t in the whole collection. 



3.2 Divergence Prom Randomness term weighting model 

The Divergence From Randomness (DFR)[?] term weighting model infers the 
informativeness of a term by the divergence between its distribution in the top- 
ranked documents and a random distribution. The most effective DFR term 
weighting model is the Bol model that uses the Bose-Einstein statistics[?,?]: 

1 + P 

w(t) = tf x * log 2 — 5-^ + log (I + P n ) (6) 

where tf x is the frequency of the query term in the top-ranked documents and 
P n is given by F/overN, where F is the frequency of the query term in the 
collection and N is the number of documents in the collection. 



4 Combined query expansion method 

The two approaches tested in this work can complement each other because they 
rely on different information. Specifically, the the performance of the coocurrence 
approach can be reduced by those words, which are not stop- words, but are very 
frequent in the collection. Those words, which represent a kind of noise, can 
reach probability a high position in the term index, thus worsen the expansion 
process. However, this kind of word well, in general, have a low score in KLD or 
Bol, precisely because their high probability in any set of the document collec- 
tion. Accordingly, combining the coocurrence measures with others based on the 
informative content of the terms, such as KLD or Bol, helps to eliminate the 
noise terms, thus improving the query expansion process, and thus the retrieved 
information. 

Our combined model consists in retrieving In the combined approaches the 
number of selected terms depends of the overlapping between the terms proposed 
by both approaches. 



Lecture Notes in Computer Science 5 



5 Methods for Reweighting the Expanded Query Terms 

After candidate list has been generated by the methods showed above, the se- 
lected terms that will be added to the query must be re-weighted. Different 
schemas have been proposed for this task. We have compared these schemas and 
experimented which is the most appropriate for each expansion method and for 
our combined query expansion method. 

The classical approach to term re- weighting is the Rocchio algorithm [?]. 
In this work we have used Rocchio's beta formula, which requires only the {3 
parameter, and computes the new weight qtw of the term in the query as: 

qtf „ w (t) 
Qtw = + (3 * 4- (7) 

where w(t) is the old weight of term t, w max (t) is the maximum w(t) of the 
expanded query terms, (3 is a parameter, qtf is the frequency of the term t in 
the query and qtf max is the maximum term frequency in the query q. In all our 
experiments, [3 is set to 0.1. 

We have also tested other reweighting schemes, each of which directly comes 
from of the proposed methods for the candidate term selection. These schemes 
use the ranking values obtained applying the functions defined by each method. 
Each of them can be only be applied to reweight terms selected with the method 
they derive from. It is due to these methods require data collected during the 
selection process, which are specific of each method. 

For the case of the reweighting scheme derived from KLD, the new weight is 
obtained directly applying KLD to the candidate terms. Terms belonging to the 
original query maintain their value [?] 

For scheme deriving from the co-occurrence method, that we called SumASS, 
the weights of the candidate terms are computed by: 

, rel(q,t e ) 
qtw = (8) 

where J2 t . eq * ^ s tue sum °^ we ight of the original terms[?]. 

Finally, for the reweighting scheme deriving from the Bose-Einstein statistics, 
a normalization of Bol, that we call BoNorm, we have defined a simple function 
based in the normalization of the values obtained by Bose-Einstein computation: 

qtw = =^L_ ( 9 ) 

where Bot G cl is the sum of the Bose-Einstein values for all terms included 
in the candidate list obtained applying Bose-Einstein statistics. 



6 Experiments 



Lucene Vector Space Model implementation has been used to build our infor- 
mation retrieval system. Stemming and stopword removing has been applied in 
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indexing and expansion process. Evaluation is carried out on the Spanish EFE94 
corpus which is part of the CLEF collection [?] (approximately 215K documents 
of 330 average word length and 352K unique index terms) and the 2001 Spanish 
topic set, with 100 topics corresponding to 2001 and 2002 years, of which we 
only used the title (of 3.3 average word length). 

We have used different measures to evaluate each method. Each of them 
provides a different estimation of the precision of the retrieved documents, which 
is the main parameter to optimize when doing query expansion, since recall is 
always improved be the query expansion process. The measures considered have 
been: 

— MAP (Mean Average Precision) , which es the average of the precision (per- 
cent of retrieved documents that are relevant) value obtained for the top set 
of documents existing after each relevant document is retrieved. In this way 
MAP measures precision at all recall levels and thus provides a view of both 
aspects. 

— GMAP, a variant of MAP, that uses a geometric mean rather than an arith- 
metic mean to average individual topic results. 

— Precision@X, precision after X documents (whether relevant or non-relevant) 
have been retrieved. Values averaged over all queries. If X docs were not 
retrieved for a query, then all missing docs are assumed to be non-relevant. 

— R-Precision, which measures precision after R docs have been retrieved, 
where R is the total number of relevant docs for a query. If R is greater 
than the number of docs retrieved for a query, then the non-retrieved docs 
are all assumed to be non-relevant. 

First of all we have tested the different coocurrence methods described above. 
Table ?? shows the results obtained for the different measures considered in this 
work. We can observe that Tanimoto provides the best results all the measures, 
except for P@10, but in this case the difference the result obtained with Dice, 
which is the best, is very small. According to which we have selected the Tani- 
moto similarity function for the rest of the work. 





MAP 


GMAP 


R-PREC 


P@5 


P@10 


Baseline 


0.4006 


0.1941 


0.4044 


0.5340 


0.4670 


Cosine 


0.4698 


0.2375 


0.4530 


0.6020 


0.5510 


Tanimoto 


0.4831 


0.2464 


0.4623 


0.6060 


0.5520 


Dice 


0.4772 


0.2447 


0.4583 


0.6020 


0.5530 



Table 1. Comparing different coocurrence methods. The Baseline row corresponds to 
the results of the query without expansion. P@5 stands for precision after the first five 
documents retrieved, P@10 after the first ten, and R-PREC stands for R-precision. 
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6.1 Selecting the Reweighting Method 

The next set of experiments have had the goal of determining the most appro- 
priate reweighting method for each candidate term selection method. Table ?? 
shows the results of different reweighting methods (Rocchio and SumASS) ap- 
plied after selecting the candidate terms by the coocurrence method. We can 
observe that the results are quite similar for both reweighting methods. 





MAP 


GMAP 


R-PREC 


P@5 


P" L0 


Baseline 

CooRocchio 

CooSumASS 


0.4006 
0.4831 

0.4798 


0.1941 
0.2464 

0.2386 


0.4044 
0.4623 
0.4628 


0.5340 
0.6060 
0.6080 


0.4670 
0.5520 

0.5490 



Table 2. Comparing different reweighting methods for cooccurrence. CooRocchio cor- 
responds to using coocurrence as selection terms method and Rocchio as reweighting 
method. CooSumASS corresponds to using coocurrence as selection terms method and 
SumASS as reweighting method. Best results appear in boldface. 



Table ?? shows the results of different reweighting methods (Rocchio and 
kid) applied after selecting the candidate terms with KLD. The best results are 
obtained using kid as reweighting method. 





MAP 


GMAP 


R-PREC 


P@5 


P" L0 


Baseline 

KLDRocchio 

KLDkld 


0.4006 
0.4788 
0.4801 


0.1941 
0.2370 
0.2376 


0.4044 
0.4450 
0.4526 


0.5340 
0.5960 
0.6080 


0.4670 
0.5480 
0.5510 



Table 3. Comparing different reweighting methods for KLD. KLDRocchio corresponds 
to using KLD as selection terms method and Rocchio as reweighting method. KLDkld 
corresponds to using KLD as selection terms method and kid as reweighting method. 
Best results appear in boldface. 



Table ?? shows the results of different reweighting methods (Rocchio and 
BoNorm) applied after selecting the candidate terms with Bof . In this case, the 
best results are obtained using BoNorm as reweighting method. 

The results of this section show that the best reweighting method after se- 
lected term by coocurrence is Rocchio, while for the distribution analysis meth- 
ods used as selection method the best reweighting is obtained with the their 
derived method, though Rocchio also provides results very close in all cases. 

6.2 Parameter Study 

We have studied two parameters that are fundamental in query expansion, the 
number of candidate terms to expand the query and the number of documents 
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Abstract. Query expansion is a well known method to improve the 
performance of the information retrieval systems. In this work we have 
tested different approaches to extract the candidate query terms from the 
top ranked documents returned by the first-pass retrieval. One of them is 
the co-occurrence approach, based on measures of co-occurrence of the 
candidate and the query terms in the retrieved documents. The other 
one, the probabilistic approach, is based on the probability distribution 
of terms in the collection and in the top ranked set. We compare the 
retrieval improvement obtained expanding the query with terms obtained 
with different methods belonging to both approaches. Besides, we have 
developed a naive combination of both approaches, with which we have 
obtained results that improve the obtained with any of them separately. 
This result confirms that the information provided by each approach has 
a different nature and, therefore can be used in a combined manner. 



1 Introduction 

The reformulation of the user queries is a common technique in information 
retrieval to cover the gap between the original user query and his necessity of 
information. The most used technique for query reformulation is query expan- 
sion, where the original query user is expanded with new terms extracted from 
different sources. Queries submitted for users are usually very short and query 
expansion can complete the information need of the users. 

A very complete review on the classical techniques of query expansion was 
done by Efthimiadis [?]. Different methods of query expansion have been used 
to improve the retrieval performance in the different tracks of Text Retrieval 
Conference TREC especially in Robust and HARD tracks but also in Web and 
Terabyte tracks. 

The main problem for the query expansion methods is that in some cases 
the expansion process worsen the query performance. Improving the robustness 
of query expansion has been the goal of many researchers for the last years, 
and most proposed approaches use external collections, such as the Web docu- 
ments, to extract candidate terms for the expansion. There are other methods 
that extract the candidate terms from the same collection where the search is 
performed. Some of these methods are based on global analysis where the list 
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of candidate terms is generated from the whole collection, but the are com- 
putationally very expensive and its effectiveness is not better than the one of 
methods based on local analysis. We follow also use the same collection where the 
search is performed, but applying local query expansion, also known as pseudo- 
feedback or blind feedback, which does not uses the global collection or external 
sources to expansion. This approach was first proposed by Xu and Croft and 
uses the documents retrieved for the original user query in a first pass for the 
term extraction. 

In this work wc have tested different approaches to extract the candidate 
terms from the top ranked documents returned by the first-pass retrieval. There 
exist two main approaches to rank the terms extracted from the retrieval doc- 
uments. One of them is the co-occurrence approach, based on measures of co- 
occurrence of the candidate and the query terms in the retrieved documents. 
The other one, the probabilistic approach, is based on the probability distribu- 
tion of terms in the collection and in the top ranked set. In this paper we are 
interested in testing the different techniques existing to generate the candidate 
term list. Our thesis is that the information obtained with the co-occurrence 
methods is different to the information obtained with probabilistic methods and 
these two kinds of information can be combined to improve the performance of 
the query expansion process. Our main goal is to compare the performance of 
the co-occurrence approach and the probabilistic techniques and to study the 
way of combining them to improve the query expansion process. 

After the term extraction step, the query expansion process requires a further 
step that is to re-compute the weights of the query terms that will be used in 
the search process. We present the results of combining different methods for 
the term extraction and the reweighting steps. 

Two important parameters have to be adjusted for the described process. 
One of then is the number of documents retrieved in the first pass to be used for 
the term extraction. The other one is the number of candidate terms that are 
finally used to expand the original query user. We have performed experiments 
to set both of them to its optimal value in each considered method. 

The rest of the paper proceeds as follows: sections 2 and 3 describes the co- 
occurrence and the probabilistic approaches, respectively; section 4 presents the 
different reweighting methods considered to assigned new weight to the query 
terms after the expansion process; section 5 is devoted to show the experiments 
performed to evaluate the different expansion techniques separately and com- 
bined and section 6 summarizes the main conclusions of this work. 



2 Cooccurrence Methods 

The methods based on term co-occurrence have been used since the 70 's to 
identify some of the semantic relationships that exit among terms. In the first 
works of K. Van Rijsbergen and K. Sparck Jones we find the idea of using co- 
occurrence statistics to detect some kind of semantic similarity between terms 
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and using it to expand the user's queries. In fact, this idea is based on the 
Association Hypothesis: 

If an index term is good at discriminating relevant from non-relevant 
documents then any closely associated index term is likely to be good at this. 

The main problem with the co-occurrence approach was mentioned by Peat 
and Willet that claim that similar terms identified by co-occurrence tend to 
occur also very frequently in the collection and therefore these terms are not 
good elements to discriminate between relevant and non-relevant documents. 
This is true when the co-occurrence analysis is done on the whole collection 
but if we apply co-occurrence analysis only on the top ranked documents the 
problem exposed by Peat and Willet is smoothed. 

For our experiments we have used the well-know Cosine, Dice and Tanimoto 
coefficients: 

Tanimoto(U,tj) = (1) 

C{ ~\~ Cj C^j 

Dice{tutj) = ^L (2 ) 

Ci ~r Cj 

Cosinejt^tj) = ° ij - (3) 

\J c i * c j 

where Cj and Cj are the number of documents in which terms and tj occur, 
respectively, and is the number of documents in which ti and tj co-occur. 
The results obtained with each of these measures, shown in section ??, show 
that Tanimoto performs better. 

We apply these coefficients to measure the similarity between terms repre- 
sented by the vectors. The result is a ranking of candidate terms where the most 
useful terms for expansion arc in the top. 

In the selection method the most likely terms are selected using the next 
equation: 

rel{q,t e ) = ^2q i *ASS(t i ,t e ) (4) 

where ASS is one of the coocurrence coefficients: Tanimoto, Dice, or Cosine. The 
equation ?? boosted the terms related with more terms of the original query 

3 Distribution Analysis Approaches 

One of the main approaches to query expansion is based on studying the dif- 
ference of term distribution between the whole collection and the subsets of 
documents that can be relevant for the query. It is expected that terms with 
little informative content have a similar distribution in any document of the col- 
lection. On the contrary terms closely related to those of the original query are 
expected to be more frequent in the top ranked set of documents retrieved with 
the original query than in other subsets of the collection. 
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3.1 Information-theoretic approach 

One of the most interesting approaches based on term distribution analysis has 
been proposed by C. Carpineto et. al.[?], and uses the concept the Kullback- 
Liebler Divergence to compute the divergence between to probability distribu- 
tions of terms in the whole collection and in the top ranked documents obtained 
for a first pass retrieval using the original query user. The most likely terms to 
expand the query are those with a high probability in the top ranked set and 
low probability in the whole collection. This divergence is computed as: 
information theory... 

KLD [PR . PC) - P R (t) * log^t (5) 

where Pn(t) is the probability of the term t in the top ranked documents, 
and Pc(t) is the probability of the term t in the whole collection. 



3.2 Divergence Prom Randomness term weighting model 

The Divergence From Randomness (DFR)[?] term weighting model infers the 
informativeness of a term by the divergence between its distribution in the top- 
ranked documents and a random distribution. The most effective DFR term 
weighting model is the Bol model that uses the Bose-Einstein statistics[?,?]: 

1 + P 

w(t) = tf x * log 2 — 5-^ + log (I + P n ) (6) 

where tf x is the frequency of the query term in the top-ranked documents and 
P n is given by F/overN, where F is the frequency of the query term in the 
collection and N is the number of documents in the collection. 



4 Combined query expansion method 

The two approaches tested in this work can complement each other because they 
rely on different information. Specifically, the the performance of the coocurrence 
approach can be reduced by those words, which are not stop- words, but are very 
frequent in the collection. Those words, which represent a kind of noise, can 
reach probability a high position in the term index, thus worsen the expansion 
process. However, this kind of word well, in general, have a low score in KLD or 
Bol, precisely because their high probability in any set of the document collec- 
tion. Accordingly, combining the coocurrence measures with others based on the 
informative content of the terms, such as KLD or Bol, helps to eliminate the 
noise terms, thus improving the query expansion process, and thus the retrieved 
information. 

Our combined model consists in retrieving In the combined approaches the 
number of selected terms depends of the overlapping between the terms proposed 
by both approaches. 
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5 Methods for Reweighting the Expanded Query Terms 

After candidate list has been generated by the methods showed above, the se- 
lected terms that will be added to the query must be re-weighted. Different 
schemas have been proposed for this task. We have compared these schemas and 
experimented which is the most appropriate for each expansion method and for 
our combined query expansion method. 

The classical approach to term re- weighting is the Rocchio algorithm [?]. 
In this work we have used Rocchio's beta formula, which rcquieres only the (3 
parameter, and computes the new weight qtw of the term in the query as: 

Qtf „ wit) 
Qtw = + (3 * 4- (7) 

where w(t) is the old weight of term t, w max (t) is the maximum w(t) of the 
expanded query terms, (3 is a parameter, qtf is the frequency of the term t in 
the query and qtf max is the maximum term frequency in the query q. In all our 
experiments, (3 is set to 0.1. 

We have also tested other reweighting schemes, each of which directely comes 
from of the proposed methods for the candidate term selection. These schemes 
use the ranking values obtained applying the functions defined by each method. 
Each of them can be only be applyed to reweight terms selected with the method 
they derive from. It is due to these methods require data collected during the 
selection process, which are specific of each method. 

For the case of the reweigthing scheme derived from KLD, the new weight is 
obtained directly applying KLD to the candidate terms. Terms belonging to the 
original query maintain their value [?] 

For scheme deriving from the co-ocurrence method, that we called SumASS, 
the weights of the candidate terms are computed by: 

, rel(q,t e ) 
qtw = (8) 

where J2 t . eq * * s tue sum °^ we ight of the original terms[?]. 

Finally, for the reweigthing scheme deriving from the Bose-Einstein statistics, 
a normalization of Bol, that we call BoNorm, we have defined a simple function 
based in the normalization of the values obtained by Bose-Einstein computation: 

q tw = =^L_ ( 9 ) 

where Bot G cl is the sum of the Bose-Einstein values for all terms included 
in the candidate list obtained applying Bose-Einstein statistics. 



6 Experiments 



Lucene Vector Space Model implementation has been used to build our infor- 
mation retrieval system. Stemming and stopword removing has been applied in 
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indexing and expansion process. Evaluation is carried out on the Spanish EFE94 
corpus which is part of the CLEF collection [?] (approxiamtely 215K documents 
of 330 average word length and 352K unique index terms) and the 2001 Spanish 
topic set, with 100 topics corresponding to 2001 and 2002 years, of which we 
only used the title (of 3.3 average word length). 

We have used different measures to evaluate each method. Each of them 
provides a different estimation of the precision of the retrieved documents, which 
is the main parameter to optimize when doing query expansion, since recall is 
always improved be the query expansion process. The measures considered have 
been: 

— MAP (Mean Average Precision) , which es the average of the precision (per- 
cent of retrieved documents that are relevant) value obtained for the top set 
of documents existing after each relevant document is retrieved. In this way 
MAP measures precision at all recall levels and thus provides a view of both 
aspects. 

— GMAP, a variant of MAP, that uses a geometric mean rather than an arith- 
metic mean to average individual topic results. 

— Precision@X, precision after X documents (whether relevant or non-relevant) 
have been retrieved. Values averaged over all queries. If X docs were not 
retrieved for a query, then all missing docs are assumed to be non-relevant. 

— R-Precision, which measures precision after R docs have been retrieved, 
where R is the total number of relevant docs for a query. If R is greater 
than the number of docs retrieved for a query, then the non-retrieved docs 
are all assumed to be non-relevant. 

First of all we have tested the different coocurrence methods described above. 
Table ?? shows the results obtained for the different measures considered in this 
work. We can observe that Tanimoto provides the best results all the measures, 
except for P@10, but in this case the difference the result obtained with Dice, 
which is the best, is very small. According to which we have selected the Tani- 
moto similarity function for the rest of the work. 





MAP 


GMAP 


R-PREC 


P@5 


P@10 


Baseline 


0.4006 


0.1941 


0.4044 


0.5340 


0.4670 


Cosine 


0.4698 


0.2375 


0.4530 


0.6020 


0.5510 


Tanimoto 


0.4831 


0.2464 


0.4623 


0.6060 


0.5520 


Dice 


0.4772 


0.2447 


0.4583 


0.6020 


0.5530 



Table 1. Comparing different coocurrence methods. The Baseline row corresponds to 
the results of the query without expansion. P@5 stands for precision after the first five 
documents retrieved, P@10 after the first ten, and R-PREC stands for R-precision. 
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6.1 Selecting the Reweighting Method 

The next set of experiments have had the goal of determining the most appro- 
priate reweighting method for each candidate term selection method. Table ?? 
shows the results of different reweighting methods (Rocchio and SumASS) ap- 
plyed after selecting the candidate terms by the coocurrence method. We can 
observe that the results are quite similar for both rcwcigthin methods. 





MAP 


GMAP 


R-PREC 


P@5 


P" L0 


Baseline 

CooRocchio 

CooSumASS 


0.4006 
0.4831 

0.4798 


0.1941 
0.2464 

0.2386 


0.4044 
0.4623 
0.4628 


0.5340 
0.6060 
0.6080 


0.4670 
0.5520 

0.5490 



Table 2. Comparing different re-weighting methods for Co-occurrence. CooRocchio 
corresponds to using coocurrence as selection terms method and Rocchio as reweighting 
method. CooSumASS corresponds to using coocurrence as selection terms method and 
SumASS as reweighting method. Best results appear in boldface. 



Table ?? shows the results of different reweighting methods (Rocchio and 
kid) applyed after selecting the candidate terms with KLD. The best results are 
obtained using kid as reweighting method. 





MAP 


GMAP 


R-PREC 


P@5 


P" L0 


Baseline 

KLDRocchio 

KLDkld 


0.4006 
0.4788 
0.4801 


0.1941 
0.2370 
0.2376 


0.4044 
0.4450 
0.4526 


0.5340 
0.5960 
0.6080 


0.4670 
0.5480 
0.5510 



Table 3. Comparing different re- weighting methods for KLD. KLDRocchio corresponds 
to using KLD as selection terms method and Rocchio as reweighting method. KLDkld 
corresponds to using KLD as selection terms method and kid as reweighting method. 
Best results appear in boldface. 



Table ?? shows the results of different reweighting methods (Rocchio and 
BoNorm) applyed after selecting the candidate terms with Bol. In this case, the 
best results are obtained using BoNorm as reweighting method. 

The results of this section show that the best reweigthing method after se- 
lected term by coocurrence is Rocchio, while for the distribution analysis meth- 
ods used as selection method the best reweighting is obtained with the their 
derived method, though Rocchio also provides results very close in all cases. 

6.2 Parameter Study 

We have studied two parameters that are fundamental in query expansion, the 
number of candidate terms to expand the query and the number of documents 



Preface 



This textbook is intended for use by students of physics, physical chemistry, 
and theoretical chemistry. The reader is presumed to have a basic knowledge 
of atomic and quantum physics at the level provided, for example, by the first 
few chapters in our book The Physics of Atoms and Quanta. The student of 
physics will find here material which should be included in the basic education 
of every physicist. This book should furthermore allow students to acquire an 
appreciation of the breadth and variety within the field of molecular physics and 
its future as a fascinating area of research. 

For the student of chemistry, the concepts introduced in this book will provide 
a theoretical framework for that entire field of study. With the help of these con- 
cepts, it is at least in principle possible to reduce the enormous body of empirical 
chemical knowledge to a few basic principles: those of quantum mechanics. In 
addition, modern physical methods whose fundamentals are introduced here are 
becoming increasingly important in chemistry and now represent indispensable 
tools for the chemist. As examples, we might mention the structural analysis of 
complex organic compounds, spectroscopic investigation of very rapid reaction 
processes or, as a practical application, the remote detection of pollutants in the 
air. 
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Abstract. The abstract should summarize the contents of the paper 
using at least 70 and at most 150 words. It will be set in 9-point font 
size and be inset 1.0 cm from the right and left margins. There will be 
two blank lines before and after the Abstract. . . . 

1 Fixed-Period Problems: The Sublinear Case 

With this chapter, the preliminaries are over, and we begin the search for periodic 
solutions to Hamiltonian systems. All this will be done in the convex case; that 
is, we shall study the boundary-value problem 

x = JH'(t,x) 
x(0) = x(T) 

with H(t, •) a convex function of x, going to +oo when ||x|| — > oo. 
1.1 Autonomous Systems 

In this section, we will consider the case when the Hamiltonian H(x) is au- 
tonomous. For the sake of simplicity, we shall also assume that it is C x . 

We shall first consider the question of nontriviality, within the general frame- 
work of (^4oo, i?oo)-subquadratic Hamiltonians. In the second subsection, we shall 
look into the special case when H is (0, 6 oc )-subquadratic, and we shall try to 
derive additional information. 

The General Case: Nontriviality. We assume that H is {A^, _B oc )-sub- 
quadratic at infinity, for some constant symmetric matrices A^ and B^, with 
Boo — Aoo positive definite. Set: 



7 : = smallest eigenvalue of — A, 



(1) 



A : = largest negative eigenvalue of J— + A, 



dt 



OO 



(2) 



Theorem 1 tells us that if A + 7 < 0, the boundary- value problem: 

x = JH'(x) 
x(0) = x{T) 



(3) 



has at least one solution x, which is found by minimizing the dual action func- 
tional: 

" "1 



(^jVu) +JV*(-u) 



dt 



(4) 



on the range of A, which is a subspace R(A) L with finite codimension. Here 



N(x) :=H{x)--(A O0 x,x) 

is a convex function, and 

N(x) < i ((Boo -Aoo)x,x) +c \fx 
Proposition 1. Assume H'(0) = and H(0) = 0. Set: 

6 := liminf 2N(x) \\x\\~ 2 . 

x— >0 

7/7 < —A < 5 ; the solution u is non-zero: 

x{t) ^ Vt . 



(5) 



(6) 



(7) 



(8) 



Proof. Condition (7) means that, for every 8' > 5, there is some e > such that 

(9) 



6' 9 
\x\\ <e=> N(x) < — \\x\\ 



It is an exercise in convex analysis, into which we shall not go, to show that 
this implies that there is an i] > such that 



f\\x\\<V^N*(y)<^j\\y\\ 2 



(10) 



Fig. 1. This is the caption of the figure displaying a white eagle and a white horse on 
a snow field 
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Since u\ is a smooth function, we will have H/iuiH^ < n for h small enough, 
and inequality (10) will hold, yielding thereby: 



V^i)<^iKll2 + ¥l7lMI 2 



(ii) 



If we choose 5' close enough to S, the quantity ( j + jr) will be negative, and 
we end up with 

ij){hu{) < for h ^ small . (12) 

On the other hand, we check directly that tp(0) — 0. This shows that cannot 
be a minimizer of tjj, not even a local one. So It 7^ and u ^ A~ 1 (0) = 0. □ 

Corollary 1. Assume H is C 2 and (aoo, boo) -subquadratic at infinity. Let £1, 
...,£n be the equilibria, that is, the solutions of H'(£) = 0. Denote by uik the 
smallest eigenvalue of H" (£fc), and set: 



uo := Min . . . , uok} 



If: 



T 

<2^ 



(13) 
(14) 



then minimization of tp yields a non-constant T -periodic solution x. 



We recall once more that by the integer part E[a] of a G H, we mean the 
a e 7Z such that a < a < a + 1. For instance, if we take aoo = 0, Corollary 2 
tells us that x exists and is non-constant provided that: 



2^ 6 ~ <1< 2^ 



or 



T G 



2tt 2tt 

W ' 60c 



(15) 
(16) 



Proof. The spectrum of A is 1^.2? + a^. The largest negative eigenvalue A is 
given by + a^, where 



— k Q + aoo < < — (k + 1) + a c 



-2^ ac 



Hence: 

The condition 7 < —A < S now becomes: 

2tt 



aoo < 



k a - < w - a 



which is precisely condition (14). 



(17) 
(18) 

(19) 
□ 
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Lemma 1. Assume that H is C 2 on R 2 ™\{0} and that H"(x) is non-degenerate 
for any x ^ 0. Then any local minimizer x of ip has minimal period T. 

Proof. We know that x, or x + £ for some constant £ e H 2 ™, is a T-periodic 
solution of the Hamiltonian system: 

x = JH'(x) . (20) 

There is no loss of generality in taking £ = 0. So tp(x) > i[)(x) for all x in 
some neighbourhood of x in IT 1 ' 2 (TR/T2Z; IR 2 ™). 

But this index is precisely the index ix{x) of the T-periodic solution x over 
the interval (0,T), as defined in Sect. 2.6. So 

ir(x) - . (21) 

Now if x has a lower period, T/k say, we would have, by Corollary 31: 

i T (x) = i kT/k (x) > ki T/k (x) + k-l>k-l>l . (22) 

This would contradict (21), and thus cannot happen. □ 

Notes and Comments. The results in this section are a refined version of [1] ; the 
minimality result of Proposition 14 was the first of its kind. 

To understand the nontriviality conditions, such as the one in formula (16), 
one may think of a one-parameter family xt, T € (27rw _1 , 2irb^) of periodic 
solutions, xt(0) = Xt(T), with xt going away to infinity when T — > 2irui~ 1 , 
which is the period of the linearized system at 0. 



Table 1. This is the example table taken out of The TgXbook, p. 246 



Year 


World population 


8000 B.C. 


5,000,000 


50 A.D. 


200,000,000 


1650 A.D. 


500,000,000 


1945 A.D. 


2,300,000,000 


1980 A.D. 


4,400,000,000 



Theorem 1 (Ghoussoub-Preiss). Assume H(t,x) is {Q 1 e)-subquadratic at 
infinity for all e > 0, and T-periodic in t 

H(t, •) is convex Vi (23) 

H(-,x) is T-pcriodic Vx (24) 
H{t,x) > n(\\x\\) with n(s)s^ 1 — > oo as s — > oo (25) 
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Ve>0, 3c : H(t,x) < ^\\xf + c . (26) 

Assume also that H is C 2 , and H"(t, x) is positive definite everywhere. Then 
there is a sequence Xk, k € IN, of kT -periodic solutions of the system 

x = JH'(t,x) (27) 

such that, for every k £ IN, there is some p a € IN with: 

P>Po=>x pk ^x k . (28) 

□ 

Example 1 (External forcing ). Consider the system: 

x = JH'{x) + f(t) (29) 

where the Hamiltonian H is (0, &oo)-subquadratic, and the forcing term is a 
distribution on the circle: 

f=j t F + f with F € L 2 (M/T2Z; IR 2 ™) , (30) 
where f a := T _1 f(t)dt. For instance, 

/(*) = E s *z > ( 31 ) 

feeiN 

where S k is the Dirac mass at t = k and £ G IR 2 " is a constant, fits the pre- 
scription. This means that the system x = JH'(x) is being excited by a series 
of identical shocks at interval T. 

Definition 1. Let A^it) and i?oo(£) be symmetric operators in IR 2 ™, depending 
continuously on t £ [0, T], such that ^(t) < B^t) for all t. 

A Borelian function H : [0, T] x H 2 " — > IR is called (A^, B^-subquadratic 



at infinity if there exists a function N(t, x) such that: 

H(t,x) = ^{A 00 {t)x,x)+N(t,x) (32) 

Vi , N(t, x) is convex with respect to x (33) 

N(t,x) > n(\\x\\) with n(s)s^ 1 -> +oo as s -> +oo (34) 

3ceIR: H(t,x)<^(B 00 (t)x,x)+c Vx . (35) 



If A^t) = aool and B^t) = b^I , with < e K ; we shall say that 
H is (aocb^-subquadratic at infinity. As an example, the function \\x\\ a , with 
1 < a < 2, is (0, e)-subquadratic at infinity for every e > 0. Similarly, the 
Hamiltonian 

H(t,x) = h\\k\\ 2 + \\x\\ a (36) 
is (k, k + e)-subquadratic for every e > 0. Note that, if k < 0, it is not convex. 
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Notes and Comments. The first results on subharmonics were obtained by Ra- 
binowitz in [5], who showed the existence of infinitely many subharmonics both 
in the subquadratic and superquadratic case, with suitable growth conditions 
on H' . Again the duality approach enabled Clarke and Ekeland in [2] to treat 
the same problem in the convex-subquadratic case, with growth conditions on 
H only. 

Recently, Michalek and Tarantello (see [3] and [4]) have obtained lower bound 
on the number of subharmonics of period kT, based on symmetry considerations 
and on pinching estimates, as in Sect. 5.2 of this article. 

References 

1. Clarke, F., Ekeland, I.: Nonlinear oscillations and boundary-value problems for 
Hamiltonian systems. Arch. Rat. Mech. Anal. 78 (1982) 315-333 

2. Clarke, F., Ekeland, I.: Solutions periodiques, du periode donnee, des equations 
hamiltoniennes. Note CRAS Paris 287 (1978) 1013-1015 

3. Michalek, R., Tarantello, C: Subharmonic solutions with prescribed minimal period 
for nonautonomous Hamiltonian systems. J. Diff. Eq. 72 (1988) 28-55 

4. Tarantello, G.: Subharmonic solutions for Hamiltonian systems via a 7Z V pseudoin- 
dex theory. Annali di Matematica Pura (to appear) 

5. Rabinowitz, P.: On subharmonic solutions of a Hamiltonian system. Comm. Pure 
Appl. Math. 33 (1980) 609-633 



Hamiltonian Mechanics2 



Ivar Ekcland 1 and Roger Temam 2 



1 Princeton University, Princeton NJ 08544, USA 
2 Universite de Paris-Sud, Laboratoire d'Analyse Numerique, Batiment 425, 
F-91405 Orsay Cedex, France 



Abstract. The abstract should summarize the contents of the paper 
using at least 70 and at most 150 words. It will be set in 9-point font 
size and be inset 1.0 cm from the right and left margins. There will be 
two blank lines before and after the Abstract. . . . 

1 Fixed-Period Problems: The Sublinear Case 

With this chapter, the preliminaries are over, and we begin the search for periodic 
solutions to Hamiltonian systems. All this will be done in the convex case; that 
is, we shall study the boundary-value problem 

x = JH'(t,x) 
x(0) = x(T) 

with H(t, •) a convex function of x, going to +oo when ||x|| — > oo. 
1.1 Autonomous Systems 

In this section, we will consider the case when the Hamiltonian H(x) is au- 
tonomous. For the sake of simplicity, we shall also assume that it is C . 

We shall first consider the question of nontriviality, within the general frame- 
work of (Am, -Boo)-subquadratic Hamiltonians. In the second subsection, we shall 
look into the special case when H is (0, 6 oc )-subquadratic, and we shall try to 
derive additional information. 

The General Case: Nontriviality. We assume that H is {A^, B^-sub- 
quadratic at infinity, for some constant symmetric matrices A^ and B^, with 
Bqc — Aoo positive definite. Set: 



7 : = smallest eigenvalue of B^ — A, 



(1) 



A : = largest negative eigenvalue of J— + A, 



dt 



(2) 



Theorem 21 tells us that if A + 7 < 0, the boundary-value problem: 



x = JH'(x) 
x(0) = x(T) 



(3) 



has at least one solution x, which is found by minimizing the dual action func- 
tional: 

" ~1 



ip(u) 



-(A^u,u)+N*(-u) 



dt 



(4) 



on the range of A, which is a subspace R(A)\ with finite codimension. Here 



N(x) :=H(x)--(A oc x,x) 



is a convex function, and 



(5) 



N(x) < - ((Boo -A^x^x) +c \/x 
Proposition 1. Assume H'(0) = and H(0) = 0. Set: 

5 := liminf 2N(x) \\x\\ 2 . 

x—*-0 

If 7 < —A < 5, the solution u is non-zero: 

x(t) ^ Vt . 



(6) 



(7) 



(8) 



Proof. Condition (7) means that, for every 5' > 5, there is some e > such that 

(9) 



\x\\<e^N(x)< 6 -\\x\\ 2 



It is an exercise in convex analysis, into which we shall not go, to show that 
this implies that there is an i] > such that 



f\\x\\<V^N*(y)<±\\y\\ 2 



(10) 



Fig. 1. This is the caption of the figure displaying a white eagle and a white horse on 
a snow field 

Since u\ is a smooth function, we will have H^uilloo < i] for h small enough, 
and inequality (10) will hold, yielding thereby: 

^i)<yiKll2 + y^lM! 2 • (11) 
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If we choose 6' close enough to 6, the quantity ( j + jr) will be negative, and 
we end up with 

ipihm) < for h ^ small . (12) 

On the other hand, we check directly that ip(0) = 0. This shows that cannot 
be a minimizer of ■0, not even a local one. Som^O and u ^ A~ 1 (0) = 0. □ 

Corollary 1. Assume H is C 2 and {a^, boo) -subquadratic at infinity. Let £i, 
. . . , £jv be the equilibria, that is, the solutions of H' '(£) = 0. Denote by uik the 
smallest eigenvalue of H" (£k), an d se t : 



:= Min {ui, . . . , Lu k } . 



If: 



2tt 



boo < ~E 



2tt 



T 



(13) 
(14) 



then minimization of tp yields a non-constant T -periodic solution x. 



We recall once more that by the integer part E[a] of a £ H, we mean the 
a e 2Z, such that a < a < a + 1. For instance, if we take aoo = 0, Corollary 2 
tells us that x exists and is non-constant provided that: 



or 



T , T 
— boo < 1 < ^~ 
2tt 2tt 



/ 2tt 2vr 

Te — r- 



(15) 



(16) 



Proof. The spectrum of A is + a^. The largest negative eigenvalue A is 

given by ^fc Q + a^, where 



27T 27T., 

— k + aoo < < — \ k o + 1) + a c 



Hence: 



k a = E 



T 



The condition 7 < —A < 5 now becomes: 



2tt, 



aoo < -^7^0 - aoo < w - a 



which is precisely condition (14). 



(17) 
(18) 

(19) 
□ 



Lemma 1. Assume that H is C 2 on TR 2n \{0} and that H"(x) is non-degenerate 
for any x 7^ 0. Then any local minimizer x of ip has minimal period T. 
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Proof. We know that x, or x + £ for some constant £ <E H n , is a T-periodic 
solution of the Hamiltonian system: 

x = JH'(x) . (20) 

There is no loss of generality in taking £ = 0. So ip(x) > ip(x) for all x in 
some neighbourhood of x in IT 1 ' 2 (B./TZZ; IR 2 ™). 

But this index is precisely the index ir{x) of the T-periodic solution x over 
the interval (0, T), as defined in Sect. 2.6. So 

ir(x) = . (21) 

Now if x has a lower period, T/k say, we would have, by Corollary 31: 

*t(x) = i kT /k{x) > ki T/k (x) + k - 1 > k - 1 > 1 . (22) 

This would contradict (21), and thus cannot happen. □ 

Notes and Comments. The results in this section are a refined version of 1980; 
the minimality result of Proposition 14 was the first of its kind. 

To understand the nontriviality conditions, such as the one in formula (16), 
one may think of a one-parameter family xt, T £ (27rw _1 , 27r&^ 1 ) of periodic 
solutions, xt(0) = xt(T), with xt going away to infinity when T — ► 2iruj~ 1 , 
which is the period of the linearized system at 0. 



Table 1. This is the example table taken out of The TgXbook, p. 246 



Year 


World population 


8000 B.C. 


5,000,000 


50 A.D. 


200,000,000 


1650 A.D. 


500,000,000 


1945 A.D. 


2,300,000,000 


1980 A.D. 


4,400,000,000 



Theorem 1 (Ghoussoub-Preiss). Assume H(t,x) is (0, s)-subquadratic at 
infinity for all e > 0, and T-periodic in t 

H(t, •) is convex Vt (23) 

H(-,x) is T-pcriodic Vx (24) 
H(t,x) > n(\\x\\) with n(s)s^ 1 — > oo as s — > oo (25) 
Ve>0, 3c : H(t,x) < ^\\x\\ 2 + c . (26) 
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Assume also that H is C 2 , and H"(t, x) is positive definite everywhere. Then 
there is a sequence Xk, k £ IN, of kT -periodic solutions of the system 

x = JH'(t,x) (27) 

such that, for every k £ IN, there is some p £ IN with: 

p>Po^>x pk ^x k . (28) 

□ 

Example 1 (External forcing ). Consider the system: 

x = JH'(x) + f(t) (29) 

where the Hamiltonian H is (0, &oo)-subquadratic, and the forcing term is a 
distribution on the circle: 

f=jF + f Q with F £ L 2 (M/T2Z; JR 2n ) , (30) 
where f := T _1 f(t)dt. For instance, 

/(*) = E - ( 31 ) 

feeiN 

where 8k is the Dirac mass at t = k and £ £ IR ™ is a constant, fits the pre- 
scription. This means that the system x = JH'(x) is being excited by a series 
of identical shocks at interval T. 

Definition 1. Let A^t) and B^t) be symmetric operators in IR 2 ™, depending 
continuously on t £ [0, T], such that A^^) < -Boo(i) for all t. 

A Borelian function H : [0, T] x IR 2 ™ — > IR is called (A^, B^-subquadratic 



at infinity if there exists a function N(t, x) such that: 

H(t,x) = ^{A 00 (t)x,x)+N(t,x) (32) 

Vt , N(t, x) is convex with respect to x (33) 

N(t,x) > n(\\x\\) with n(s)s^ 1 —> +oo as s —> +oo (34) 

3c £JR : H(t,x) < i (Boo(t)x,x) +c Vx . (35) 



If A^t) = a^I and B^t) = b^I , with < i M £ R, we shall say that 
H is (aaojbooj-subquadratic at infinity. As an example, the function \\x\\ , with 
1 < a < 2, is (0, e)-subquadratic at infinity for every e > 0. Similarly, the 
Hamiltonian 

H(t,x) = h\\k\\ 2 + \\x\\ a (36) 
is (k, k + e)-subquadratic for every e > 0. Note that, if k < 0, it is not convex. 
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Notes and Comments. The first results on subharmonics were obtained by Rabi- 
nowitz in 1985, who showed the existence of infinitely many subharmonics both 
in the subquadratic and superquadratic case, with suitable growth conditions 
on H 1 . Again the duality approach enabled Clarke and Ekcland in 1981 to treat 
the same problem in the convex-subquadratic case, with growth conditions on 
H only. 

Recently, Michalek and Tarantello (see Michalek, R., Tarantello, G. 1982 and 
Tarantello, G. 1983) have obtained lower bound on the number of subharmonics 
of period kT, based on symmetry considerations and on pinching estimates, as 
in Sect. 5.2 of this article. 
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Causality 357-359 
Center-of-mass frame 232, 274, 338 
Central potential 113-135, 303-314 
Centrifugal potential 115-116, 323 
Characteristic function 33 
Clebsch-Gordan coefficients 191-193 
Cold emission 88 
Combination principle, Ritz's 124 
Commutation relations 27, 44, 353, 391 
Commutator 21-22, 27, 44, 344 
Compatibility of measurements 99 
Complete orthonormal set 31,40,160, 
360 

Complete orthonormal system, see 
Complete orthonormal set 
Complete set of observables, see Complete 
set of operators 

Eigenfunction 34, 46, 344-346 
- radial 321 

- calculation 322-324 
EPR argument 377-378 
Exchange term 228,231,237,241,268, 
272 

/-sum rule 302 
Fermi energy 223 

H^" molecule 26 
Half-life 65 
Holzwarth energies 68 



